eNJine Logo

ENGINEERING A BETTER COMMUTE

If you have ever felt like New Jersey trains needed work, you are not wrong. In January 2019, stations across the New Jersey transit system experienced the equivalent of 614 days (about 1 year 8 months)’ worth of delays.

ALT TEXT

But what if you could know your train was running late before it was running late?

With eNJine, you can do just that. No more rushing, only to wait. No more missing your connection. No more being late. At eNJine, our data scientists are dedicated to engineering a better commute for you.

Our beta testing has focused on proving the feasibility of using machine learning to improve commuting. Currently, eNJine can predict whether trains will be departing within two minutes of their scheduled departure times two hours ahead.

Our model works because we understand how New Jersey transit works.

Below, you can see how train delays vary on days of the week and at certain stops. With this and other insight, we dial into exactly what causes New Jersey trains to be late, so you don’t have to.

Our Data

The beauty of our model is that it is simple. It can be easily replicated not only for other NJ Transit lines, but other transit systems worldwide. The model relies solely on data from NJ Transit service and it is the kind of data that transit systems likely already collect. From there, we engineered a few features that are powerful indicators of lateness. For example, the animations above shows how lateness varies not only at each stop, but also at certain times of day.

[INSERT MAP]

The engineers at eNJine have spent weeks working to better understand what makes NJ Transit’s commuter rails run late.

Below are a series of visualizations showing the way that lateness is distributed across different categories. Lateness varies across lines: the Bergan Co. Line is somewhat more likely than the Gladstone Branch Line to run late. Similarly, the plot of variation in lateness across every day of the week likewise shows that trains tend to be late more or less often on different days of the week and that in general trains run with fewer delays on weekends.

catvars<-panelnoGEOM%>% dplyr::select(y2, #from,
                             line,
                             weekday)%>%
  gather(Variable, value, -y2)%>%
  count(Variable, value, y2)%>%
  ggplot(., aes(value, n, fill=y2))+
  geom_bar(position="dodge", stat = "identity")+
  facet_wrap(~Variable, scales="free")+
  scale_fill_manual(values=palette2)+
  labs(x="y", y="Value",
       title = "Feature Associations with the Likelihood of a Train Late More Than 2 Minutes",
       subtitle = "All Catagorical Feats",
       caption= "Fig.2a")+
  theme(legend.position = "bottom", axis.text.x=element_text(angle=45, hjust=1))+plotTheme2()

catvars

numvars<- panelnoGEOM%>%dplyr::select(y2, total_departures,
                                        hour,
                                        onestoplag, 
                                        onestopearlieron,
                                        twostopearlieron,
                                        sqrtlag3Hours,
                                        lag3Hourstemp) %>%
  gather(Variable, value, -y2) %>%
  ggplot() + 
  geom_density(aes(value, color=y2), fill = "transparent") + 
  facet_wrap(~Variable, scales = "free") +
  scale_color_manual(values = palette2) +
  labs(title = "Feature Distributions with the Likelihood of a Train Late More Than 2 Minutes",
       subtitle = "All Numeric Outcomes",
       caption= "Fig.2b") +
  theme(legend.position = "bottom")+plotTheme2()

numvars

NEED TO ADD TEXT for d. What is the spatial or space/time process?

Our Approach

Our engineers experimented with a number of approaches before arriving at our final beta model. This model relies on a logistic regression to make binary predictions about whether trains, running on a particular line, within an “interval hours” will be more than two minutes late. We do this with data that is available two hours prior to the scheduled departure time.

Our engineers experimented with building separate models for different lines; ultimately, however, they determined it was possible to achieve similar results by continuing to operate with one model that includes lines as a categorical variable. In the future, our engineers may find it useful to return to this line-specific method: however, the two lines included in our proof-of-concept work operate similarly enough that this is not necessary.

In order to determine what numeric variables would be most useful for the model, our engineers relied on a correlation plot that demonstrates the relationship between various features.

numericVars <- panelnoGEOM%>%dplyr::select(all_of(Model1.vars), y2_numeric)%>%
  select_if(is.numeric) %>% na.omit()

corrplot <- ggcorrplot( 
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#260a03", "white", "#a81010"),
  type="lower",
  insig = "blank",
  lab = TRUE) +  
    labs(title = "Correlation Across Numeric Variables") 

corrplot 

This plot allowed our researchers to determine not only which variables were correlated with lateness, but also give them insight into the interactions between variables. This proved especially useful in choosing which variables to include as our engineers were limited in the computing power available to them and so built a model using fewer features than they might have otherwise.

To determine what categorical features to include, our researchers relied on their initial exploratory analysis, included above.

With more investment and computing power, we will be able to improve the accuracy of our model by increasing the specificity of inputs: i.e., building a model that takes the direction a train is travelling into account as well as iterating over shorter time intervals.

How We Perform

thresh <- ModelProbs1.thresholds%>%filter(Threshold == .25)%>%dplyr::select(contains("Rate"), Accuracy)%>%gather(Variable, Value)

resultsbar <- thresh%>%
    ggplot(aes(Variable, Value), fill = "#260a03",  colour= "white") +
      geom_bar( position = "dodge", stat = "identity", fill = "#260a03") +
      labs(title="Model Results",
           #subtitle = " ", 
           x = "Outcome",y = "Rate") +
      plotTheme2() + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") 

resultsbar
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

By cross-validating, our engineers were able to determine that the beta-model currently correctly predicts whether a train will be late 93 percent of the time. We think that this is pretty good for a beta model. It is, however, worth noting that currently our model is better at predicting when a train is going to be on time than when it is going to be late.

ModelcvFit1.plot<- dplyr::select(ModelcvFit1$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(ModelcvFit1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
  geom_histogram(bins=35, fill = "#93afdc") +
  facet_wrap(~metric) +
  geom_vline(aes(xintercept = mean), colour = "#a81010", linetype = 3, size = 1.5) +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Goodness of Fit", y="Count", title="Model 1 Goodness of Fit Metrics",
       subtitle = "Across-fold mean reprented as dotted lines",
       caption= " ") +
  plotTheme2()

ModelcvFit1.plot

Of course, for eNJine to be useful to commuters, it is imperative not only that our engineers have the resources to minimize errors, but also that we balance their directionality when they do occur. If we fail to do this well, eNJine won’t be useful to commuters. Even minor errors can result in commuters missing their trains.

Michael Scott struggles to get on a moving freight train

There are various ways to envision the tradeoffs of a logistic model: below we have inculde both a distribution of probabilities as well as an ROC curve. We have set our threshold to balance the ratio of true positives and true negatives that the model predicts.

 model1plot<- ggplot(ModelProbs1, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
   geom_vline(xintercept = 0.25,  
                color = "#93afdc", size=2, alpha=0.5)+
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Late Over 2 Mins", y = "Density of probabilities",
       title = "Model 1: Distribution of Predicted Probabilities by Actual Outcome",
       subtitle = "Threshold: 0.25", 
       caption= " ") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")+plotTheme2()


Modelroc.1<- ggplot(ModelProbs1, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#a81010") +
     geom_vline(xintercept = 0.25,  
                color = "#93afdc", size=2, alpha=0.5)+
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Model 1", subtitle = paste("AUC:", round(Modelauc1[1], digits=3)) )+plotTheme2()


plot_grid(model1plot, Modelroc.1, align = "h")

Our engineers are always making our model better.

For this proof-of-concept report, our scope was limited to just two lines on New Jersey Transit: the Bergen Co. Line and the Gladstone Branch. Obviously, this tool is only helpful if you’re a commuter on one of those two lines, so in future iterations of this project it would be wise to include all the lines.

In terms of powerful features, we were surprised by how little weather impacted the models. It would be useful to identify other external factors that impact train delays. Our engineers are interested in including public health data in future versions of the model, as they believe this may act as a useful proxy for worker absenteeism (likewise: drinking holidays). Future models may also be improved by incorporating information about specific models of trains as well as planned maintenance and service disruptions.

Additionally, our current product incorporates a few different lag features, one of which is based on whole hour intervals. We recognize that that may not be the best approach in engineering a time-based lag for New Jersey Transit, but we simply lack the computing power to model shorter intervals. Similarly, we were unable to incorporate a significant route direction feature, NJ train interaction with freight lines, and employee attendance.

Code Library

# Libraries -----------------
library(tidyr)
library(dplyr)
library(lubridate)
library(sf)
library(tidycensus)
library(tidyverse)
library(tigris)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(gtsummary) #for table cause broom always f's everything up
library(pROC)
library(tibble)
library(cowplot)
## FOR MAPS ~~
library(RColorBrewer)
library(ggmap)  
library(maps)
library(mapdata)
library(ggcorrplot)



options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")


palette6 <- c("#260a03", "#a81010", "#f8cd12", "#f2fdff", "#93afdc", "#0c0e1d" )

palette5 <- c("#260a03", "#a81010", "#f8cd12", "#f2fdff", "#93afdc", "#0c0e1d" )

palette4 <- c("#a81010", "#f8cd12", "#93afdc", "#0c0e1d" )

palette3 <- c("#a81010", "#f8cd12", "#93afdc")

palette2 <- c("#260a03", "#a81010")

palette2b <- c( "#260a03", "#f8cd12")

plotTheme2 <-function(base_size = 12, title_size = 18) {
  theme(
    panel.background =element_blank(), 
    text = element_text( color = "#0c0e1d", family="Helvetica"),
    plot.title = element_text(size = title_size, colour = "#0c0e1d", family="Helvetica", face="bold"), 
    plot.subtitle = element_text(face="italic", family="Helvetica", colour = "#0c0e1d"),
    plot.caption = element_text(hjust=0, colour = "#0c0e1d", family="Helvetica"),

    axis.ticks.y = element_line(color = "grey80", size = 0.1),
    plot.background =element_blank(),
    
    panel.grid.major.x   = element_line("grey80", size =0.1 ),
    panel.grid.major.y   =element_line("grey80", size = 0.05),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "#93afdc", fill=NA, size=4),
    #panel.spacing = unit(c(2,2,2,2), "cm"),
    strip.background = element_rect(fill = "#93afdc", color = "#93afdc"),
    strip.text = element_text(size=12,color ="#0c0e1d", family="Helvetica"),
    axis.title = element_text(size=12, color ="#0c0e1d", family="Helvetica",  face="italic"),
    axis.text = element_text(size=10, color ="#0c0e1d", family="Helvetica",),
    
    legend.background = element_blank(),
    legend.title = element_text(colour = "#0c0e1d", face = "italic", family="Helvetica"),
    legend.text = element_text(colour = "#0c0e1d", face = "italic", family="Helvetica"),
    strip.text.x = element_text(size = 14, family="Helvetica", colour = "#0c0e1d")
  )
}


plotTheme3 <-function(base_size = 12, title_size = 18) {
  theme(
    panel.background=element_rect(fill = "#93afdc", color = "#f2fdff"), 
    text = element_text( color = "#0c0e1d", family="Helvetica"),
    plot.title = element_text(size = title_size, colour = "#0c0e1d", family="Helvetica", face="bold"), 
    plot.subtitle = element_text(face="italic", family="Helvetica", colour = "#0c0e1d"),
    plot.caption = element_text(hjust=0, colour = "#0c0e1d", family="Helvetica"),

    axis.ticks.y = element_line(color = "grey80", size = 0.1),
    plot.background =element_blank(),
    
    panel.grid.major.x   = element_line("grey80", size =0.1 ),
    panel.grid.major.y   =element_line("grey80", size = 0.05),
    panel.grid.minor = element_blank(),
    panel.border =element_blank(),
    #panel.spacing = unit(c(2,2,2,2), "cm"),
    strip.background = element_rect(fill = "#f2fdff", color = "#f2fdff"),
    strip.text = element_text(size=12,color ="#0c0e1d", family="Helvetica"),
    axis.title = element_text(size=12, color ="#0c0e1d", family="Helvetica",  face="italic"),
    axis.text = element_text(size=10, color ="#0c0e1d", family="Helvetica",),
    
    legend.background = element_blank(),
    legend.title = element_text(colour = "#0c0e1d", face = "italic", family="Helvetica"),
    legend.text = element_text(colour = "#0c0e1d", face = "italic", family="Helvetica"),
    strip.text.x = element_text(size = 14, family="Helvetica", colour = "#f2fdff")
  )
}



mapTheme2<-function(base_size = 12, title_size = 16) {
  theme(
    text = element_text( color = "#0c0e1d", family="Helvetica"),
    plot.title = element_text(size = title_size,colour = "#0c0e1d", family="Helvetica"),
    plot.subtitle=element_text(face="italic", family="Helvetica"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "#93afdc", fill=NA, size=2),
    strip.text.x = element_text(size = 14, color="#0c0e1d", family="Helvetica"),
    strip.background  = element_rect(fill="#93afdc", color="#93afdc"))
}













# RawData -----------
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
jan2019 <- read.csv("https://raw.githubusercontent.com/bri-ne/MUSA508-Final/main/data/2019_01.csv?token=AVOS24NFX2PEGNUZQUPRNY3BXYPQY")
feb2019 <- read.csv("https://raw.githubusercontent.com/bri-ne/MUSA508-Final/main/data/2019_02.csv?token=AVOS24PUUJDM2PW36N5T6K3BXYPQ4")
mar2019 <- read.csv("https://raw.githubusercontent.com/bri-ne/MUSA508-Final/main/data/2019_03.csv?token=AVOS24MRII4RKEMYGNWWUSDBXYPQ6")
weather.Data <- 
  riem_measures(station = "EWR", date_start = "2019-01-01", date_end = "2019-05-01")





# Warngledate ----
njtransitdf <- rbind(jan2019, feb2019, mar2019)


njtransitdf <- njtransitdf %>% 
  mutate(scheduled_time_dt = as_datetime(scheduled_time),
         actual_time_dt = as_datetime(actual_time),
         delay_dt = seconds(as.integer(delay_minutes*60)), 
         intervalhour = floor_date(scheduled_time_dt, unit= "hour"),
         week = week(intervalhour)) %>%
  na.omit(delay_dt)







# Stations ----

stoplocations <- read.csv("https://raw.githubusercontent.com/bri-ne/MUSA508-Final/main/Rail_Stations_of_NJ_Transit.csv?token=AVSVPEVU7X6RXZP3RNOK2CLBW7VZO")

stoplocations <- stoplocations %>%rename(X = "ï..X")%>%st_as_sf(coords = c("X", "Y"), crs =3424, agr = "constant") 

#stoplocations <- stoplocations %>%st_as_sf(coords = c("X", "Y"), crs =3424, agr = "constant") 







# StudyPanel ----

train.template <- 
  filter(njtransitdf, week %in% c(1: 10)) %>%
  filter(line == "Bergen Co. Line " | line == "Gladstone Branch")%>%
  mutate(train_counter=1)%>%
  group_by(line, intervalhour, from) %>%
  summarize(train_count = sum(train_counter, na.rm=T),
            late = mean(delay_dt))

study.panel <- 
  expand.grid(intervalhour = unique(train.template$intervalhour), 
              from = unique(train.template$from),
              line= unique(train.template$line))





# TrainPanel ----


train.panel <- train.template %>%
  right_join(study.panel)





# Weather  ----

weather.Panel <-  
  weather.Data %>%
    mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
    replace(is.na(.), 0) %>%
    mutate(intervalhour = ymd_h(substr(valid, 1, 13))) %>%
    mutate(dotw = wday(intervalhour, label=TRUE)) %>%
    group_by(intervalhour) %>%
    summarize(Temperature = max(tmpf),
              Percipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature)) 

train.weather.panel <- merge(x=train.panel, y= weather.Panel, by= 'intervalhour', all.x= T)


train.weather.panel[c("late","train_count")][is.na(train.weather.panel[c("late","train_count")])] <- 0 






# Building Time Lags ----


lag.df <- train.weather.panel %>%
  group_by(line, intervalhour) %>%
  summarise(avg_late = mean(late),
            total_departures = sum(train_count),
            Temperature = mean(Temperature), 
            Percipitation = mean(Percipitation),
            Wind_Speed = mean(Wind_Speed)) %>%
  mutate(lag2Hours = dplyr::lag(avg_late,2),
         lag3Hours = dplyr::lag(avg_late,3),
         lag4Hours = dplyr::lag(avg_late,4),
         lag12Hours = dplyr::lag(avg_late,12),
         lag1day = dplyr::lag(avg_late,24),
         lag1week = dplyr::lag(avg_late, 168), 
         sqrtlag2Hours = sqrt(lag2Hours),
         sqrtlag3Hours = sqrt(lag3Hours),
         sqrtlag4Hours = sqrt(lag4Hours),
         sqrtlag12Hours = sqrt(lag12Hours),
         sqrtlag1day = sqrt(lag1day),
         sqrtlag1week = sqrt(lag1week),
         sqrtlag2hourstotal_departures = sqrt(dplyr::lag(total_departures)),
         lag2Hourswind = dplyr::lag(Wind_Speed,2),
         lag3Hourswind = dplyr::lag(Wind_Speed,3),
         lag4Hourswind = dplyr::lag(Wind_Speed,4),
         lag12Hourswind = dplyr::lag(Wind_Speed,12),
         lag3Hourstemp = dplyr::lag(Temperature,3),
        lag3HoursPercipitation = dplyr::lag(Percipitation,3))

train.weather.panel.lags = merge(train.weather.panel, lag.df, by=c("line", "intervalhour"))

train.weather.panel.lags[c("lag2Hours","lag3Hours","lag4Hours", "lag12Hours", "lag1day", "lag1week",
                           "sqrtlag2Hours", "sqrtlag3Hours", "sqrtlag4Hours", "sqrtlag12Hours", "sqrtlag1day", "sqrtlag1week", "sqrtlag2hourstotal_departures", "lag2Hourswind", "lag3Hourswind", "lag4Hourswind", "lag12Hourswind", "lag3Hourstemp", "lag3HoursPercipitation")][is.na(train.weather.panel.lags[c("lag2Hours","lag3Hours","lag4Hours", "lag12Hours", "lag1day", "lag1week", "sqrtlag2Hours", "sqrtlag3Hours", "sqrtlag4Hours", "sqrtlag12Hours", "sqrtlag1day", "sqrtlag1week", "sqrtlag2hourstotal_departures", "lag2Hourswind", "lag3Hourswind", "lag4Hourswind", "lag12Hourswind", "lag3Hourstemp", "lag3HoursPercipitation")])] <- 0 





# Space Lags ----

stoplag<- njtransitdf %>%
 filter(week %in% c(1: 10)) %>%
 filter(line == "Bergen Co. Line " | line == "Gladstone Branch")%>%
 group_by(line, train_id)%>%arrange(stop_sequence)%>%
  mutate(onestoplag = dplyr::lag(delay_minutes),
         twostoplag = dplyr::lag(delay_minutes,2)) %>%
  ungroup() %>%
  group_by(line, from, intervalhour) %>%
  summarise(onestoplag = mean(onestoplag), twostoplag=mean(twostoplag)) %>%
  mutate(onestopearlieron = dplyr::lag(onestoplag,1), twostopearlieron = dplyr::lag(twostoplag,2))%>%
  dplyr::select(intervalhour, line, from, onestoplag, twostoplag, onestopearlieron, twostopearlieron) 

stoplag[c("onestoplag", "twostoplag", "onestopearlieron", "twostopearlieron")][is.na(stoplag[c("onestoplag", "twostoplag", "onestopearlieron", "twostopearlieron")])] <- 0

train.weather.panel.lags.final <- left_join(train.weather.panel.lags, stoplag, by=c("line", "intervalhour", "from"))






# Final Setup ----

panelnoGEOM <- train.weather.panel.lags.final %>% 
  mutate(y2_numeric = ifelse(late < minutes(2), 0, 1),
         y2 = ifelse(late < minutes(2), "no", "yes"),
         lateM = late/60,
         hour= hour(intervalhour),
         daynumeric= wday(intervalhour),
         week = week(intervalhour),
         weekday = weekdays(intervalhour),
         y5_numeric = ifelse(late < minutes(5), 0, 1),
         y5 = ifelse(late < minutes(5), "no", "yes"),
         week_numeric = as.numeric(week))

panelnoGEOM <- panelnoGEOM %>%
  mutate(STATION_ID = from) 

panelGEOM <- left_join(panelnoGEOM, stoplocations) %>%
  st_as_sf(crs =3424, agr = "constant")



panelnoGEOM[c("onestoplag", "twostoplag", "onestopearlieron", "twostopearlieron")][is.na(panelnoGEOM[c("onestoplag", "twostoplag", "onestopearlieron", "twostopearlieron")])] <- 0


set.seed(2121)
panelnoGEOMTrain <- filter(panelnoGEOM, week < 9) #### BRI, YOU MUST CHANGE THIS BACK TO A 9 
panelnoGEOMTest <- filter(panelnoGEOM, week >= 9)







## Model ----

Model1<- glm(y2_numeric ~ .,
                        data=panelnoGEOMTrain %>% 
                          dplyr::select(y2_numeric,
                                        total_departures,
                                        #from,
                                        hour,
                                        line,
                                        #lag1week,
                                        weekday,
                                        #lag1day,
                                        #toPenn.binary, 
                                        #tuesWed.binary, 
                                        #friWed.binary,
                                        #toSigStop,
                                        onestopearlieron,
                                        twostopearlieron,
                                        onestoplag,
                                        #sqrtlag2Hours,
                                        sqrtlag3Hours,
                                        #sqrtlag4Hours,
                                        #sqrtlag1week,
                                        lag3Hourstemp),  
                        family="binomial" (link="logit"))







## Cross Validation ----

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

##### Model 1-


Model1.vars <- colnames(panelnoGEOMTrain %>% 
                          dplyr::select(total_departures,
                                        #from,
                                        hour,
                                        line,
                                        #lag1week,
                                        weekday,
                                        #lag1day,
                                        #toPenn.binary, 
                                        #tuesWed.binary, 
                                        #friWed.binary,
                                        #toSigStop,
                                        onestoplag,
                                        onestopearlieron,
                                        twostopearlieron,
                                        #sqrtlag2Hours,
                                        sqrtlag3Hours,
                                        #sqrtlag4Hours,
                                        #sqrtlag1week,
                                        lag3Hourstemp))

ModelcvFit1 <- caret::train(y2 ~ .,
                        data=panelnoGEOMTrain %>% 
                          dplyr::select(all_of(Model1.vars), y2), #na.action = na.pass,
                        method="glm", family="binomial",
                        metric="ROC", trControl = ctrl)






## Predictions ----


ModelProbs1  <- data.frame(Outcome = as.factor(panelnoGEOMTest$y2_numeric),
                         Probs = predict(Model1, panelnoGEOMTest, type="response"))


ModelProbs1.thresholds <- 
  iterateThresholds(data=ModelProbs1, observedClass = Outcome, 
                    predictedProbs = Probs)

ModelProbs1.thresholds<-ModelProbs1.thresholds%>%arrange(Rate_TP,Rate_TN) ## .30

modeltestProbs1 <- 
  ModelProbs1 %>%
  mutate(predOutcome  = as.factor(ifelse(ModelProbs1$Probs > .25 , 1, 0)))


Modelauc1<-pROC::auc(ModelProbs1$Outcome, ModelProbs1$Probs) #0.9501

I have not included this yet. trying to figure out the best way to get the maps in here.